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Our agenda for today 


Exploring the motivation behind running Spark on Kubernetes 


o Architecture, benefits, our background with it 


Harnessing the potential of spot instances for substantial cost savings 


o Running driver on on-demand nodes, picking best spot markets (AZ, instance type] 


otrategies for gracefully handling spot instance interruptions 


o Executor decommissioning and PVC reuse when executor is lost 


Conclusion - Future works and best practices with Spark on Kubernetes 


The motivation behind running 
Spark on Kubernetes 


Apache Spark is the #1 analytics engine for Big Data & Al 


Fast 
Massively 
parallelizable, 
efficient 
read and write 


Object stores 


Easy 
Interfaces with 
well-known 
programming 
languages 


Versatile PU 
Across multiple use Ci 


cases 


ETL/ELT 
Pipelines 


Data 


warehouses Streams 


spake 
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Real-time ML and data 
analysis science 


SQL/NoSQL 
databases 
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Bl and data 
exploration 


The role of resource manager in a Spark cluster 


Worker Node 


Executor 


Cluster Manager 
Worker Node 


Executor 


Spark depends on cluster 
manager for orchestration 
of a job on a cluster 


Driver Program 


SparkContext 
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Kubernetes is the latest cluster manager for Spark 


Standalone: built- 
in, limited 


functionalities 


Apache Mesos: 
deprecated as of 
Spark 3.2.0 


Hadoop YARN: Kubernetes: most 
most widely used 


today popular among 


new deployments 


The Spark on Kubernetes Journey 


Feb 2018 - Spark 2.3 June 2020 - Spark 3.0 Oct 2021 - Spark 3.2 Apr 2023 - Spark 3.4 
Initial support released for Dynamic Allocation, Local code Dynamic PVC mounting and reuse, Faster S3 PVC-oriented executor pod allocation 
Spark on Kubernetes upload, Kerberos Support Writes (Magic Committer enabled) 


Nov 2018 - Spark 2.4 March 2021 - Spark 3.1 June 2020- Spark 3.3 Sep 2023- Spark 3.5 
Client Mode, Volume Mounts, Spark on Kubernetes generally available Executor Rolling in Kubernetes Upgrade kubernetes- 
PySpark and R support Graceful node shutdown, NFS mounts, environment, Support Customized client 

Dynamic Persistent Volume Claims Kubernetes Schedulers 


opark on YARN: architecture & pain points 


YARN FRE 
| Manager 


Worker Node 
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Global Spark version and shared libraries 
e You'll have a Spark 2.4 cluster, a Spark 3.0 
cluster, a Spark 3.1 cluster. 
e Transient clusters are recommended for 
stability, but increase costs. 


Limited Docker image support * 
e Environment is built from AMIs and bash 
scripts, flaky runtime library installation 
e Debugging is painful - there's no way to run 
opark locally, environment is subtle 


Resource Overhead 
e Slow startup time 
e YARN master node, YARN Node Mgr are JVM 
processes using a lot of resources. 


opark on Kubernetes: architecture & benefits 


Kubernetes cluster 


Kubernetes master 


an 
Sheduler 


1. Spark-submit 


API Server 
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2. Start Driver pod 


3. Request executor pods spä 
5. Notify of new executor Spark driver pod 


6. Schedule tasks on executors 


v Y 


soak Soa 


Spark executor pod Spark executor pod Spark executor pod 
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4. Shedule executor pod 


Native Dockerization 
e Simpler dependency management 
e Reliable executions across environments 
(locally during development, staging, production) 
e Faster startup time 


A single long-running cluster 
e Quick to scale up (and down) based on load 
e  Mixdifferent Spark versions 
e  MixSpark and non-Spark apps 
e  Mixuse cases (notebooks, batch/streaming jobs) 


A standard, agnostic infrastructure layer 
e  Heduce lock in 
e Simplify your operations 
e Leverage the open-source tools from the cloud-native 
ecosystem 


The Pros & Lons of Spark on Kubernetes 


The pros 


e Better dev experience with Docker. 


An ecosystem of cloud-nativetools. 


Effective resource sharing enabling 


significant savings on cloud costs. 


k8s can be the standard infrastructure 
layer across your entire stack: flexible, 
cloud- and vendor-agnostic 


The cons 


e Datateams should not have to become 
Kubernetes experts. 


e Kubernetes introduces powerful but 
complex abstractions, and requires the 
maintenance of many components. 


e Thesupportfor Spark-on-Kubernetes on 
leading Spark products is absent or 
barebone. 
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Network shuffle Intensive Query(q64-v2.4) CPU Intensive Query(q70-v2.4) VO Intensive Query(q82-v2.4) 
1400s - wu Yam wu Yarn mm Yarn 
mE Kubernetes 1005 - mm Kubernetes mm Kubernetes 
120s - 
1200s - 
80s 
80s - 78s 
1000s - 
877s 
60s - 
go E 5 
$ $ 3 
e [4 [4 
8 8 a 
600s 
40s - 
400s 
20s - 
200s - 
0s - Os - 


Query time median/min/max Query time median/min/max Query time median/min/max 


TPC-DS Running Time Per Query 


400 mm kubernetes 
ı zu yam 


Duration(s) 


3 3 


eo 


0 20 40 60 80 100 
Query Index 


aws.amazon.com/fr/blogs/containers/optimizin erformance-on-kubernetes 


2 Ocean for Apache Spark 


Developer friendly, continuously scaled & optimized Spark on k8s 


STORAGE - CUSTOMER CLOUD ACCOUNT 


SPOT.IO BACKEND 


Optimization logic Driver, Spark 3.1 
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On a serverless & continuously optimized 
infrastructure 


Infrastructure Spark-aware autoscaling History-based 


provisioning and e _ Each Spark application dynamically scales Optimizations of 


pricing up and down based on the load. Spark Configs 


e — Autoscaling e Gracefully handle spot kills & node shutdown. Optimization based on the 


k8s cluster historical workload 
e ÜOptimizeinstances characteristics. 
based on the spot 
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Harnessing the potential 
of spot instances for 
substantial cost savings 


Spot instances 


Up to 90% cheaper than their on-demand counterparts 


e Available on AWS, GCP, and Azure Example of AWS spot instance price history 


[one instance type, various AZs, 3 months} 


e Availability is not guaranteed 
o When you ask to launch a spot VM, the cloud 
provider can deny this request 
o Once a spot VM is launched, it can be reclaimed, 
at any time and at short notice m SOO Ba 


Spot Instance pricing history 


price a e us-west-2a a e us-west-2b G © us-west-2c O © us-west-2d 
sa $0.0820 002203, $0.0692 oct 02.2028, 11:5: $0.0957 ot 02 2023, 11:52 $0.0981 021022023, 11:5: 
$0.1048 ^ $0.0893 Average hour co 80.1334 ^ 


sat $0.1034 Average hourly cos werage hourly 
53.47% Average savings 46.12% Average savings 90.54% Average savings 


e Spot price varies in real-time 
o Based on supply & demand 


o Across 100s of independent spot markets: 
m Cloud region 
m Availability zone within the region 
m Instance type 
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How does spark cope with spot interruptions’? 


Best practice: Spark driver should run on an on-demand node 


Shuffle files illustration 


If you lose the Spark driver: 
° The Spark app abruptly fails, and venue m m 


must be restarted from scratch. 


Executor: Executor. Executors 


Shuffle Write 


If you lose a Spark executor, the app will 
have to recompute: 1 2 Executors 


° The tasks which were in progress 


when the executor died : 
l . Shuffle Read 
e Shuffle files: output of previous tasks 
stored on the executor 
e Cached data 


Executor: Executor Executor: 


Best practice: run driver OD, execs on Spot 


You can achieve this in kBs using node selectors 


e Example on AWS (EKS) and using cluster-autoscaler 
o Define node labels and AutoScaling Group tags 


lifecycle: spot k8s.io/cluster-autoscaler/node-template/label/lifecycle: spot 


lifecycle: ondemand k8s.io/cluster-autoscaler/node-template/label/lifecycle: ondemand 


e Add the relevant node selectors to your pods specs: 


This is how your cluster may look like 


mb5.large on-demand | 
| instances (2 cores each) | 


But this isn't very stable yet | | 


CA AAO AA AA EN 


=-------------------------------) 


| r5.xlarge spot instances ' 
(4 cores each) | 


e Ifthe r5.xlarge instance isn't available in 


"instanceSelector": “mb.large”, 
stuck in pending state, and your app "cores": 2, | 
. “spot”: false | 
won't run (potentially for hours) 1 i | 


the spot market, your executors will be — | “driver’: { | 


“executor”: { | 

"instanceSelector": "rb.xlarge", | 

e You may lose all your Spark executors at | 'cores': 4, ; | 
l “instances”: 6, | 

once (which makes recovery harder]. "spot": true | 


) 


B Execpod(4 cores) | E 


a ud ILLI Eus ei 


Solution: pick the best possible spot market 


Best availability zone, best instance type, fallback to On Demand instances 


INSTANCE AVAILABILITY 


Minimum Instance Lifetime @ 


| ito | 3hrs | 6hrs | 12hrs zan | 


Spot Market Scoring @ 


Preferred Availability Zones @ 
us-west-2a 
Select at least one availability zone in the Compute tab 
Preferred Spot Types @ 
m4.large 
Select at least one spot type in the Compute tab 


Draining Timeout (seconds) @ 


—. 


@ SPOT 


vyNetApp 


This is how your cluster may look like 


Blue Application 


‚rd.2xlarge spot instances: 'rb.8xlarge spot instances! 


"bere | a (8 cores each) 4 (32 cores each) 
"instanceSelector": “m5”, | T I | 
"cores": 1, M mb.large, on-demand 


"spot": false W | Driver pod(1core) 
j | 


"instanceSelector": "rb", 
"cores": 4, 

"spot": true, 
"instances": 8 


) 


Orange Application 
{ 


“driver”: { 
"instanceSelector": "mb", 


"executor": { 
"instanceSelector": “r5”, 
"instances": 10 


) 
) 
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Limitation: Avoid cross-AZ data transfer 


Co-localize all pods of a given Spark app on the same AZ 


e The AZ selection should be done once upon Spark application submission, so 
that the driver and the executors pods all go to the same AZ. 


e Otherwise, you will suffer from cross-AZ data transfer: 
© Which hurts shuffle performance significantly 
© And cloud providers charge a fee for this 


e The additional flexibility granted by spreading executors across multiple AZs 
is not worth the penalty of cross-AZ transfer. 
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We ran an experiment to measure the impact 


Under 2 different configurations 


same test Spark workload: 


e 7 driver (1 core, on-demand], 10 executors (4 cores each, spot] 
e Spark executor task consist of sleeping for 55 minutes 


o Such as the application run in about one hour, if no spot interruption occurs. 


Run every hour for 2 weeks during business hours (9-5) under 2 settings: 


e Static: Availability zone hardcoded, instance type hardcoded (mb.xlarge] 
e Optimized: AZ flexible, instance type flexible within mb family 


Experiment results 


Spot market optimization avoids 79% of spot interruptions 


# of Spark apps that # of execs launched Avg # of spot kills per 
ran application 


Avg app duration 


Static configuration 
(m5.xlarge) 139* 1817 


1 hour 20 min 
(+20 min vs ideal) 


Optimized 
configuration 147 1567 
(m5 family) 


1 hour 5 min 
(*5 min vs ideal) 


“Sometimes, the applications with a hardcoded configuration to run on mä.xlarge spot instances did not run 
at all, due to a lack of spot nodes availability. 


How to handle spot 
instances interruptions 


Since Spark 3.2: Graceful Exec Decommissioning 


e Before interrupting a spot instance, cloud providers give a notice: 
o Termination notice: 2 min on AWS, 3Us on GCP, 30s on Azure 


O 
O 


This signal can be intercepted by a NodeTerminationHandler (k6s Daemonset) 
The daemonset then sends a message to the executor, which sends a message to the driver 


e The driver then does the following: 


O 


O 
O 
O 


Stop scheduling task on the executor which is going to go away 

Do not count task failure on this executor against the maximum number of retriable failures 
Move the shuffle files and cached data from this executor to another executor 

Update the state of shuffle files location accordingly 


Node 3 


Termination Handler 


Spark Driver 


Executor 1 
Shuffle 


Cache 


Termination Handler 


Executor 2 
Shuffle 


Cache 


Termination Handler 


1. Termination 
Handler notices 
that the node is 
going to be 
spot-killed in 
120 seconds. 


Executor 1 
Shuffle 


Cache 


WEE, nation Handler 


Node 3 


Termination Handler 


Spark Driver 


Executor 2 
Shuffle 


Cache 


Termination Handler 


Node 3 


Termination Handler 


Spark Driver 


2. The Spark driver blacklists exec 1. Newtasks are not 
scheduled on it anymore. When a task running on exec 1 fails, it 
does not count against max # of failures. 


1. Termination Executor 1 


Shuffle 


Executor 2 
Shuffle 


Handler notices 
that the node is 


going to be Cache 


spot-killed in 
120seconds. Bd 
ene Termination Handler 


Cache 


Termination Handler 


Node 3 


Termination Handler 


Spark Driver 


2. The Spark driver blacklists exec 1. New tasks are not 
scheduled on it anymore . When a task running on exec 1 fails, it 


3. The Spark application can continue with minimal impact from the 
does not count against max # of failures. 


node / executor loss. We didn't lose any shuffle or cached data! 


1. Termination Executor 1 
Handler notices 
that the node is 
going to be 


spot-killed in 
120seconds. Bs 
en Termination Handler 


Executor 2 
Shuffle Shuffle 


Cache Cache 


Termination Handler 
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Graceful Exec Decommissioning 


How to enable this feature: 
e Install NodeTerminationHandler for your cloud provider as a k6s daemonset 
e Turn on the following configuration flags: 
spark.decommission.enabled 
spark.storage.decommission.rddBlocks.enabled 
spark.storage.decommission.shuffleBlocks.enabled 
spark.storage.decommission.enabled 


Limitations: 
e Very large shuffle files may not have enough time to be migrated 
e |f many executors get spot-killed at the same time, we may lose some shuffle files 
o For example the driver learns that exec-1 is decommissioned, decides to move files to exec-2, and then the driver learns 
that exec-2 is decommissioned too 
o This gives another argument in favor of spreading executors across multiple spot markets (“do not put all your eggs in 


one basket”) 


Graceful Exec Decommissioning - Experiment 


e We ran workloads that produced a significant amount of shuffle files on the executors. 
* We simulated the process of receiving a spot kill 


e Using the Storage tab in the Spark Ul and parsing information from the driver and executor logs, 
we could measure 
° The data stored on the executors prior to detachment 
° The data moved from one executor to another during decommissioning 
° The time Spark spent moving files 


e We tested with 4-core executors across different instance types (m5, m5d, i3, ...) 


On average, Graceful Executor Decommissioning moved ~15GB/minute of shuffle data on 
regular instances, and 35-40GB of data/minute on SSD-backed instances. 


since Spark 3.2: Executor PVC Reuse 


e Since Spark 3.1, it’s possible to configure Spark to 
dynamically provision and mount Persistent Volume Claims 


(PVCs). 


o Butthe PVC and executor share the same fate. 


e As of Spark 3.2, PVCs mounted onto executors can survive 
the removal of its original executor, to be mounted on a 
new executor instead. 

o This means the shuffle files can be recovered after 
a spot kill, or even another failure (such as an 
OutOfMemory error). 


Node 


Node 


E Shuffle 
files 


Persistent 
volume claim 


Node 


m Shuffle 
files 


Persistent 
volume claim 


Node 


Node 


m ai Shuffle 
files 


Persistent 
volume claim 


Executor PVC Reuse 


How to enable this feature: 
e Configure dynamic PVCs (see open-source documentation, there are many possibilities) 


e Turn on the following configuration flags: 
Spark.kubernetes.driver.reusePersistent VolumeClaim * 
Spark.kubernetes.driver.ownPersistent VolumeClaim 


Limitations: 
e This feature is not compatible with using local NVMe based SSDs for shuffle files (PVCs are typically backed by remote volumes 
such as EBS) 


o Local NVMe based SSDs offer 5-10x performance improvement for shuffle-heavy workloads 


e * In our tests, a race condition sometimes causes a PVC not to be re-used immediately, so the shuffle file recovery does not 
work every time: 


o spark.kubernetes.driver.wait ToReusePersistent VolumeClaim=true (since Spark 3.4) 


e This feature requires a bit more configuration than the graceful decommissioning feature. 


Lonclusion: Future works 
and best practices for 
opark on kös 


How to make Spark run reliably on spot VMs 


substantial cost savings without trading off performance or stability 


e Driver should run on demand, executors on spot 


e Optimize the spot market to avoid spot kills 
o Pick the best AZ (and use it for the entire app] 
o Spread executors across multiple spot instance types, based on real time spot market 
dynamics 
e Gracefully handle spot kills by proactively moving shuffle files when a spot 
termination occurs 


What's new in Spark 3.4 for Spark-on- 


e [SPARK41515 


Spark / SPARK-41515 
PVC-oriented executor pod allocation 


v Details 
Type: New Feature Status 
Priority 2A Major Resolution 
Affects Version/s 3.4.0 Fix Version/s: 
Component/s Kubernetes 
Labels releasenotes 


v Issue Links 


is related to 


SPARK-44993 Add ShuffleChecksumUtils.compareChecksums by reusing ShuffleChecksumTestHelp.compareChecksums 


© SPARK-41559 Dynamic Allocation on K8S GA 


relates to 


SPARK-44945 Validate checksum of remounted PVC's shuffle data before recovery 


v Sub-Tasks 
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20 
3.9 
4.9 
5.9 
eo 
7.0 
8.0 
9.9 


Support PVC-oriented executor pod allocation 

getReusablePVCs should ignore recently created PVCs in the previous batch 
getReusablePVCs should handle accounts with no PVC permission 
recoverDiskStore should not stop by existing recomputed files 

Upgrade kubernetes-client to 5.12.3 

Add 'PVC-oriented executor pod allocation” section and revise config name 
Improve onNewSnapshots to use unique list of known executor IDs and PVC names 
Skip PVC cleanup when driver doesn't own PVCs 


Show a directional error message for PVC Dynamic Allocation Failure 


» 


» 


RESOLVED 


RESOLVED 


RESOLVED 


RESOLVED 


RESOLVED 


RESOLVED 


RESOLVED 


RESOLVED 


RESOLVED 


RESOLVED 


RESOLVED 


RESOLVED 


Dongjoon Hyun 
Dongjoon Hyun 
Dongjoon Hyun 
Dongjoon Hyun 
Dongjoon Hyun 
Dongjoon Hyun 
Dongjoon Hyun 
Pralabh Kumar 


Qian Sun 


v People 
Assignee 
Reporter 


Votes 


Watchers 


v Dates 
Created 
Updated 


Resolved 


] PVC-oriented executor pod allocation 


Dongjoon Hyun 
Dongjoon Hyun 
0 Vote for this issue 


1 Start watching this issue 


14/Dec/22 05:43 
13/Sep/23 08:13 
16/Dec/22 17:31 


S 
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What's new in Spark 3.4 and above for Spark-on-kös 


e [SPARK-452 70 | Custom k8s Scheduler support (Volcano) SS, 


o Enable YARN-like capabilities such as queue, gang scheduling, etc P a 9f 26 
o 9 9 9 9 8 8 


VOLCANO 


Kubernetes Native Batch System 


gr SPARK-45270 
PK Upgrade "Volcano " to 1.8.0 


Details v People 
Type E Improvement Assignee: É Dongjoon Hyun 
Priority A Major Resolution Fixed Reporter Dongjoon Hyun 
Affec 4.0.0 Fix Version/s 4.0.0 
Votes 0 Vote for this issue 
Component/s: ^  Kubernetes 
Watchers 1 Start watching this issue 
= masa Apache YuniKorn 
v Dates p 
Issue Lin 
Created 1 week ago 05:28 
links to Updated 1 week ago 06:46 
© GitHub Pull Request #43050 4 week ago 0646 
Activit 
All Comments WorkLog History Activity Transitions 
add 
Issue resolved by pull request 43050 
https. //github.com/apache/spark/pull/43050 


What's new in Spark 3.4 and above for Spark-on-kös 


Pr / SPARK-44951 
Improve Spark Dynamic Allocation 


v Details v People 
Type: Improvement Status: | OPEN | Assignee: ? | Unassigned 
Priority: A Major Resolution: Unresolved Reporter: Holden Karau 
Affects Version/s: 4.0.0 Fix Version/s: None 
Shepherd: Holden Karau 
Component/s: Kubernetes, Spark Core, YARN 
Votes: 3 Vote for this issue 
Labels: None 
Watchers: 5 Start watching this issue 


v Description 


v Dates 
For Spark 4 we should aim to improve Spark's dynamic allocation. Some potential ideas here includes the following: 
Created: 24/Aug/23 19:01 
* Plug-gable DEA algorithms 
Updated: 24/Aug/23 19:01 


* How to reduce wastage on the RM side? Sometimes the driver asks for some units of resources. But when RM provisions them, the driver cancels it. 

* Support for "warm" executor pools which are not tied to a particular driver but start and wait for a driver to connect to them to "claim" them. 

* More explicit Cost Vs AppRunTime confiugration: A good DEA algo should allow the developer to choose between cost and runtime. Sometimes developers might be ok to pay 
higher costs for faster execution. 

* Use previous run information to inform future runs 

* Better selection of executors to be scaled down 


v Sub-Tasks T 
1. Log a warning (or automatically disable) when shuffle tracking is enabled along side another DA supported mechanism E] OPEN Unassigned 
2. Make DEA algorithms pluggable E] OPEN Unassigned 
3. Add the option for dynamically marking containers for preemption based data GJ OPEN Unassigned 

v Activity 


All Comments WorkLog History Activity Transitions 


There are no comments yet on this issue. 
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Thank you 


NAE ASE CONEERENGCE 


Ocean for Apache Spark 


« 


Hichem Kennich 
Data & Al Product Architect 


